2.1 A Structural Model For the Johnson & Johnson data, say \(y_t\) , let \(x_t=\log(yt)\).In this problem,we are going to fit a special type of structural model, \(x_t = T_t + S_t + N_t\), where \(T_t\) is a trend component, \(S_t\) is a seasonal component, and \(N_t\) is noise. In our case, time t is in quarters (1960.00, 1960.25, . . . ) so one unit of time is a year.

  1. Fit the regression model: \[ x_t = \beta_t + \alpha_1Q_1(t) + \alpha_2Q_2(t) + \alpha_3Q_3(t) + \alpha_4Q_4(t) + w_t \] where \(Q_i(t) = 1\) if time t corresponds to quarter i = 1, 2, 3, 4, and zero otherwise. The \(Q_i(t)\)’s are called indicator variables. We will assume for now that wt is a Gaussian white noise sequence.
x_t <- log(y_t)
Q <- factor(cycle(x_t))
lm.out <- lm(x_t ~  time(x_t) + Q + 0,na.action = NULL)
sum1 <- summary(lm.out)
sum1
## 
## Call:
## lm(formula = x_t ~ time(x_t) + Q + 0, na.action = NULL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29318 -0.09062 -0.01180  0.08460  0.27644 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## time(x_t)  1.672e-01  2.259e-03   74.00   <2e-16 ***
## Q1        -3.283e+02  4.451e+00  -73.76   <2e-16 ***
## Q2        -3.282e+02  4.451e+00  -73.75   <2e-16 ***
## Q3        -3.282e+02  4.452e+00  -73.72   <2e-16 ***
## Q4        -3.284e+02  4.452e+00  -73.77   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1254 on 79 degrees of freedom
## Multiple R-squared:  0.9935, Adjusted R-squared:  0.9931 
## F-statistic:  2407 on 5 and 79 DF,  p-value: < 2.2e-16
  1. If the model is correct, what is the estimated average annual increase in the logged earnings per share?
cat("The average annual increase in the logged earnings per share is ",round(coef(sum1)[1,1],2)," per year",sep='')
## The average annual increase in the logged earnings per share is 0.17 per year
  1. If the model is correct, does the average logged earnings rate increase or decrease from the third quarter to the fourth quarter? And, by what percentage does it increase or decrease?

In any given year, logged earnings for the fourth quarter are:

\[ x_{t,4} = \beta t + \alpha_4 + w_t \] while third quarter earnings are: \[ x_{t,3} = \beta t + \alpha_3 + w_t \] The difference between these is: \[ x_{t,4}-x_{t,3} = \alpha_4-\alpha_3 \]

cat("The average logged earnings per share ",ifelse(coef(sum1)[5,1] > coef(sum1)[4,1],'increases','decreases')," by ", abs(coef(sum1)[5,1]-coef(sum1)[4,1]),sep='')
## The average logged earnings per share decreases by 0.2687577
  1. What happens if you include an intercept term in the model in (a)? Explain why there was a problem.

There is perfect collineairty betweent the dummy variables and the intercept, meaning that there is no unique OLS solution. This causes one of the dummies to be dropped. Now, the equation sets quarter 1 as the reference quarter.

lm.out2 <- lm(x_t ~  time(x_t) + Q ,na.action = NULL)
sum2 <- summary(lm.out2)
sum2
## 
## Call:
## lm(formula = x_t ~ time(x_t) + Q, na.action = NULL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29318 -0.09062 -0.01180  0.08460  0.27644 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.283e+02  4.451e+00 -73.761  < 2e-16 ***
## time(x_t)    1.672e-01  2.259e-03  73.999  < 2e-16 ***
## Q2           2.812e-02  3.870e-02   0.727   0.4695    
## Q3           9.823e-02  3.871e-02   2.538   0.0131 *  
## Q4          -1.705e-01  3.873e-02  -4.403 3.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1254 on 79 degrees of freedom
## Multiple R-squared:  0.9859, Adjusted R-squared:  0.9852 
## F-statistic:  1379 on 4 and 79 DF,  p-value: < 2.2e-16
  1. Graph the data, \(x_t\) , and superimpose the fitted values, say \(\hat{x_t}\), on the graph. Examine the residuals, \(x_t − \hat{x_t}\) , and state your conclusions. Does it appear that the model fits the data well (do the residuals look white)?

First we plot the original series and the fitted values:

df <- data.frame(time = time(x_t),fitted=predict(lm.out),resid=lm.out$residuals)
p <- autoplot.zoo(x_t)
p + geom_line(aes(x=time,y=fitted),data=df,color='red')

ggplot(aes(x=fitted,y=resid),data=df) + geom_point() + geom_smooth(se=F)

The model does not appear to fit the data very well, particularly in larger fitted values. The residuals do not appear to be white noise.

2.2 For the mortality data examined in Example 2.2:

  1. Add another component to the regression in (2.21) that accounts for the particulate count four weeks prior; that is, add \(P_{t−4}\) to the regression in (2.21). State your conclusion.
temp = tempr-mean(tempr)
tempsq = temp^2
lm.out1 <- dynlm(cmort ~ time(cmort) + temp + tempsq + part)
lm.out2 <- dynlm(cmort ~ time(cmort) + temp + tempsq + part + L(part,4))
sum1 <- summary(lm.out1)
sum2 <- summary(lm.out2)
print(sum1)
## 
## Time series regression with "ts" data:
## Start = 1970(1), End = 1979(40)
## 
## Call:
## dynlm(formula = cmort ~ time(cmort) + temp + tempsq + part)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.0760  -4.2153  -0.4878   3.7435  29.2448 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.831e+03  1.996e+02   14.19  < 2e-16 ***
## time(cmort) -1.396e+00  1.010e-01  -13.82  < 2e-16 ***
## temp        -4.725e-01  3.162e-02  -14.94  < 2e-16 ***
## tempsq       2.259e-02  2.827e-03    7.99 9.26e-15 ***
## part         2.554e-01  1.886e-02   13.54  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.385 on 503 degrees of freedom
## Multiple R-squared:  0.5954, Adjusted R-squared:  0.5922 
## F-statistic:   185 on 4 and 503 DF,  p-value: < 2.2e-16
print(sum2)
## 
## Time series regression with "ts" data:
## Start = 1970(5), End = 1979(40)
## 
## Call:
## dynlm(formula = cmort ~ time(cmort) + temp + tempsq + part + 
##     L(part, 4))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.228  -4.314  -0.614   3.713  27.800 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.808e+03  1.989e+02  14.123  < 2e-16 ***
## time(cmort) -1.385e+00  1.006e-01 -13.765  < 2e-16 ***
## temp        -4.058e-01  3.528e-02 -11.503  < 2e-16 ***
## tempsq       2.155e-02  2.803e-03   7.688 8.02e-14 ***
## part         2.029e-01  2.266e-02   8.954  < 2e-16 ***
## L(part, 4)   1.030e-01  2.485e-02   4.147 3.96e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.287 on 498 degrees of freedom
## Multiple R-squared:  0.608,  Adjusted R-squared:  0.6041 
## F-statistic: 154.5 on 5 and 498 DF,  p-value: < 2.2e-16
print(AIC(lm.out1))
## [1] 3332.282
print(AIC(lm.out2))
## [1] 3291.52

Adding the lagged particulate count significantly reduces the AIC, leading to the selection of the second model. Looking at plots of the fitted values against the residuals:

df1 <- data.frame(time=time(cmort),fitted1=predict(lm.out1),resid1=lm.out1$residuals)
df2 <- data.frame(time=time(lm.out2),fitted2=predict(lm.out2),resid2=lm.out2$residuals)
p <- ggplot(aes(x=fitted1,y=resid1),data=df1)
p  + geom_point() + geom_smooth(se=F)

p %+% df2 + aes(x=fitted2,y=resid2) +  geom_point() + geom_smooth(se=F)

The plots gives similar results, showing a good fit except for points at larger fitted values.

  1. Draw a scatterplot matrix of \(M_t\), \(T_t\), \(P_t\), and \(P_{t−4}\) and then calculate the pairwise correlations between the series. Compare the relationship between \(M_t\) and \(P_t\) versus \(M_t\) and \(P_{t−4}\).
df3 <- ts.intersect(cmort,stats::lag(part,k=4),temp,part)
df3 %>% pairs()

The graphs appear to show a positive relationship between \(P_{t-4}\) and \(M_t\) as well as \(P_t\) and \(M_t\).

cor(df3)
##                              cmort stats::lag(part, k = 4)        temp
## cmort                    1.0000000              0.13975806 -0.44035520
## stats::lag(part, k = 4)  0.1397581              1.00000000 -0.06893067
## temp                    -0.4403552             -0.06893067  1.00000000
## part                     0.4501258              0.53405047 -0.01686434
##                                part
## cmort                    0.45012582
## stats::lag(part, k = 4)  0.53405047
## temp                    -0.01686434
## part                     1.00000000

2.3 In this problem, we explore the difference between a random walk and a trend stationary process.

  1. Generate four series that are random walk with drift of length \(n = 100\) with \(\delta = .01\) and \(\sigma_w = 1\). Call the data \(x_t\) for t = 1,…,100. Fit the regression \(x_t = \beta t + w_t\) using least squares. Plot the data, the true mean function (i.e., \(\mu_t = .01 t\)) and the fitted line, \(\hat{x_t} = \beta t\), on the same graph.
delta <- 0.01
create_four_drift <- function(x){
  w = cumsum(rnorm(100))
  return(ts(delta*1:100 + w))
}
series <- lapply(1:4,create_four_drift)
create_four_plots1 <- function(x){
  p <- autoplot.zoo(x)
  return(p + geom_abline(slope=0.01) + geom_smooth(method='lm',formula=y~x-1,se=F))
}
plots <- lapply(series,create_four_plots1)
marrangeGrob(plots,ncol=2,nrow=2)

  1. Generate four series of length \(n = 100\) that are linear trend plus noise, say \(y_t =.01t+w_t\),where \(t\) and \(w_t\) areas in part(a). Fit the regression \(y_t =\beta t+w_t\) using least squares. Plot the data, the true mean function (i.e., \(\mu_t = .01 t\)) and the fitted line, \(\hat{y_t} = \hat{\beta}t\), on the same graph.
create_four_lin <- function(x){
  w = rnorm(100)
  return(ts(delta*1:100 + w))
}
series2 <- lapply(1:4,create_four_lin)
create_four_plots2 <- function(x){
  p <- autoplot.zoo(x)
  return(p + geom_abline(slope=0.01) + geom_smooth(method='lm',se=F))
}
plots2 <- lapply(series2,create_four_plots2)
marrangeGrob(plots2,ncol=2,nrow=2)

2.8 The glacial varve record plotted in Figure 2.7 exhibits some nonstationarity that can be improved by transforming to logarithms and some additional nonstationarity that can be corrected by differencing the logarithms.

  1. Argue that the glacial varves series, say \(x_t\) , exhibits heteroscedasticity by com- puting the sample variance over the first half and the second half of the data. Argue that the transformation \(y_t = log x_t\) stabilizes the variance over the series. Plot the histograms of \(x_t\) and \(y_t\) to see whether the approximation to normality is improved by transforming the data.
data("varve")
x_t <- varve
first <- 1:(length(x_t)/2)
first_xt <- x_t[first]
second_xt <- x_t[-first]
print(var(first_xt))
## [1] 133.4574
print(var(second_xt))
## [1] 594.4904
cat("The variance in the first half is ",round(var(first_xt)/var(second_xt),3)," the size of the variance in the second half\n",sep='')
## The variance in the first half is 0.224 the size of the variance in the second half
y_t <- log(x_t)
first_yt <- y_t[first]
second_yt <- y_t[-first]
print(var(first_yt))
## [1] 0.2707217
print(var(second_yt))
## [1] 0.451371
cat("The variance in the first half is ",round(var(first_yt)/var(second_yt),3)," the size of the variance in the second half\n",sep='')
## The variance in the first half is 0.6 the size of the variance in the second half
df <- data.frame(xt = as.numeric(varve),logxt = log(as.numeric(varve)))
p <- df %>% ggplot(aes(xt))
p + geom_histogram()

p %+% aes(logxt) + geom_histogram()

  1. Plot the series \(y_t\). Do any time intervals, of the order 100 years, exist where one can observe behavior comparable to that observed in the global temperature records in Figure 1.2?
autoplot.zoo(y_t)

There looks to be cyclical behavior, with a cycle every 500 years or so.

  1. Examine the sample ACF of \(y_t\) and comment.
ggAcf(y_t)

  1. Compute the difference \(u_t = y_t − y_{t−1}\), examine its time plot and sample ACF, and argue that differencing the logged varve data produces a reasonably stationary series. Can you think of a practical interpretation for \(u_t\)? Hint: Recall Footnote 1.2.
ut <- y_t-stats::lag(y_t,k=1)
ggAcf(ut)

2.9 In this problem, we will explore the periodic nature of \(S_t\) , the SOI series displayed in Figure 1.5.

  1. Detrend the series by fitting a regression of \(S_t\) on time t. Is there a significant trend in the sea surface temperature? Comment.
data(soi)
lm.out <- lm(soi ~ time(soi),na.action = NULL)
detrend <- lm.out$residuals
summary(lm.out)
## 
## Call:
## lm(formula = soi ~ time(soi), na.action = NULL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.04140 -0.24183  0.01935  0.27727  0.83866 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.70367    3.18873   4.298 2.12e-05 ***
## time(soi)   -0.00692    0.00162  -4.272 2.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3756 on 451 degrees of freedom
## Multiple R-squared:  0.0389, Adjusted R-squared:  0.03677 
## F-statistic: 18.25 on 1 and 451 DF,  p-value: 2.359e-05
p1 <- autoplot.zoo(detrend) 
p2 <- autoplot.zoo(soi)
grid.arrange(p1,p2)

2.10 Consider the two weekly time series oil and gas. The oil series is in dollars per barrel, while the gas series is in cents per gallon.

  1. Plot the data on the same graph. Which of the simulated series displayed in Section 1.2 do these series most resemble? Do you believe the series are stationary (explain your answer)?
df1 <- data.frame(time=time(oil),var=oil/42)
df2 <- data.frame(time=time(astsa::gas),var=astsa::gas/100)

ggplot() + geom_line(data=df1,aes(x=time,y=var,color='Oil')) + geom_line(data=df2,aes(x=time,y=var,color='Gas'))

These series look like they could be random walk with an upward drift, which are not stationary series.

  1. In economics, it is often the percentage change in price (termed growth rate or return), rather than the absolute price change, that is important. Argue that a transformation of the form \(y_t = \nabla\log(x_t)\) might be applied to the data, where \(x_t\) is the oil or gas price series. Hint: Recall Footnote 1.2.

The series \(y_t\) is: \[ y_t = \log(x_t)-\log(x_{t-1}) = \log\left(\dfrac{x_t}{x_{t-1}}\right) \]

  1. Transform the data as described in part (b), plot the data on the same graph, look at the sample ACFs of the transformed data, and comment.
goil <- diff(log(oil))
ggas <- diff(log(astsa::gas))
df1 <- data.frame(time=time(goil),var=goil)
df2 <- data.frame(time=time(ggas),var=ggas)

ggplot() + geom_line(data=df1,aes(x=time,y=var,color='Oil Growth')) + geom_line(data=df2,aes(x=time,y=var,color='Gas Growth'))

The sample ACFs are given by:

acf(goil)

acf(ggas)

  1. Plot the CCF of the transformed data and comment The small, but significant values when gas leads oil might be considered as feedback.
cross_corr = ccf(ggas,goil)

  1. Exhibit scatterplots of the oil and gas growth rate series for up to three weeks of lead time of oil prices; include a nonparametric smoother in each plot and comment on the results (e.g., Are there outliers? Are the relationships linear?).
multiple_lead <- function(i){
  oillead <- stats::lag(oil,k=-i)
  goil.lead <- diff(log(oillead))
  df <- ts.intersect(goil.lead=goil.lead,ggas=ggas,time=time(goil.lead),dframe=TRUE)
  return(ggplot(df,aes(y=ggas)) + geom_point(aes(x=goil.lead))) 
}
l <- lapply(1:3,multiple_lead)
marrangeGrob(l,ncol=3,nrow=1)

  1. There have been a number of studies questioning whether gasoline prices respond more quickly when oil prices are rising than when oil prices are falling (“asymmetry”). We will attempt to explore this question here with simple lagged regression; we will ignore some obvious problems such as outliers and autocorrelated errors, so this will not be a definitive analysis. Let \(G_t\) and \(O_t\) denote the gas and oil growth rates.

  2. Fit the regression (and comment on the results) \[ G_t = α_1 + α_2I_t + \beta_1O_t + \beta_2O_{t−1} + wt \] where \(I_t = 1\) if \(O_t \geq 0\) and 0 otherwise (It is the indicator of no growth or positive growth in oil price).

goil <- diff(oil)
ggas <- diff(astsa::gas)
lm.out <- dynlm(ggas ~ goil + stats::lag(goil,-1) + I(goil > 0))
summary(lm.out)
## 
## Time series regression with "ts" data:
## Start = 2000(3), End = 2010(25)
## 
## Call:
## dynlm(formula = ggas ~ goil + stats::lag(goil, -1) + I(goil > 
##     0))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.387  -2.601  -0.061   2.939  79.107 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.2089     0.5304  -2.279  0.02304 *  
## goil                   1.7081     0.1585  10.779  < 2e-16 ***
## stats::lag(goil, -1)   0.2076     0.1147   1.810  0.07083 .  
## I(goil > 0)TRUE        2.3204     0.8210   2.826  0.00488 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.786 on 539 degrees of freedom
## Multiple R-squared:  0.3915, Adjusted R-squared:  0.3881 
## F-statistic: 115.6 on 3 and 539 DF,  p-value: < 2.2e-16
coef(summary(lm.out))[,1]
##          (Intercept)                 goil stats::lag(goil, -1) 
##            -1.208896             1.708090             0.207631 
##      I(goil > 0)TRUE 
##             2.320438
  1. What is the fitted model when there is negative growth in oil price at time t? What is the fitted model when there is no or positive growth in oil price? Do these results support the asymmetry hypothesis? The model when the the oil growth rate is greater than 1 is:

\[ G_t = -1.21 + 2.32 + 1.71O_{t} + 0.21O_{t-1} \]

The model when the oil growth rate is less than 0:

\[ G_t = -1.21 + 1.71O_{t} + 0.21O_{t-1} \]

Yes, there is evidence of the asymmetry hypothesis. The growth rate of gasoline depends not only on the level of the oil growth rate, but also its sign. A positive growth rate boosts gasoline prices beyond what would be predicted just by the oil growth rate.

  1. Analyze the residuals from the fit and comment.
df <- data.frame(resid= lm.out$residuals, fitted = predict(lm.out))
ggplot(df,aes(x=fitted,y=resid)) + geom_point() + geom_smooth(se=F)

The fit is fairly good, except for some outliers.

2.11 Use two different smoothing techniques described in Section 2.3 to estimate the trend in the global temperature series globtemp. Comment.

kern <- ksmooth(time(globtemp),globtemp,'normal',bandwidth=4)$y
sp <- smooth.spline(time(globtemp),globtemp,spar=0.6)$y
df <- data.frame(time=time(globtemp),kernel = kern, spline = sp, temp=globtemp)
ggplot(df,aes(x=time)) + geom_line(aes(y=temp)) + geom_line(aes(y=kern,color='Kernel'),lineend='round') +
  geom_line(aes(y=sp,color='Smoothing Splines'),lineend='round')

We use kernel smoothing with a bandwidth of 4 and smoothing splines with a smoothing parameter of \(\lambda = 0.6\).